Final Project - Data Analytics¶

The final project Abschlussprojekt consists of two parts:

                                            Part 1 - Data analysis
                                            Part 2 - Dashboard

1. Data analysis¶

You work as a data scientist at the P-2-P platform https://www.kiva.org/, which was founded a year ago. Now you want to expand your business. Your team has split up and each analyst has a sub-area of the data. Your task is to find insights for your platform in an explorative data analysis.

Requirements:

            - Complete data preparation
            - Documentation of the work steps
            - Comprehensible and meaningful explanations of procedure/decisions
            - for the evaluation: name 3 different types of plots (plot functions)
            - Customizing of the plots
            - Analysis of the plots
            - coding


CRISP DM: Business Understanding¶

Your business model is to operate a platform (crowd-investing) where people who have a business idea but not the money needed can register and raise money for their project within a specified period of time.

On the other hand, you have investors who would like to invest their money in projects and are looking for investments.

As an intermediary, your platform brings borrowers and lenders together.

Your database is the history of your platform.

Additional assumptions about the business model

All projects are completed projects, i.e. the time to raise money for these projects has expired. Your business model provides for the collected funds to be paid out even if the target amount has not been reached.

You earn your money with a commission for every project that lands on your platform.

The backer receives interest for lending the money.

CRISP DM: Data Understanding¶

- funded_amount ... Amount received/disbursed at the end of the crowding period in USD
- loan_amount ... Target amount (amount you wanted to achieve for the project) in USD
- activity ... Subcategory to which the goal of the crowd project thematically belongs
- sector ... Top category in which the crowd project topic falls
- use ... Brief description of what the money is to be used for
- country_code ... Country code according to ISO standard
- country ... Country name according to ISO standard
- region ... region
- currency ... Currency in which the funded_amount was then paid out
- term in months ... Duration over which the loan is to be disbursed
- lender_count ...lender (i.e. how many people have given money for the project)
- borrower_genders ... Gender and number of borrowers, i.e. those who initiated the crowd project
- repayment interval ... Contractually agreed repayment modalities/rhythm

CRISP DM: Data Preparation¶

In [1]:
# load needed libraries

import pandas as pd 
import plotly_express as px 

Load data¶

In [2]:
# load data as a test to identify the correct separator

df_test = pd.read_csv('data_abschlussprojekt.csv',
                      sep='/n',
                      engine='python',
                      nrows=2
                      )

df_test
Out[2]:
# funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval
0 0#300.0#300.0#Fruits & Vegetables#Food#To buy ...
1 1#575.0#575.0#Rickshaw#Transportation#to repai...
In [3]:
# load data with the correct separator #

df_kiva = pd.read_csv('data_abschlussprojekt.csv',
                      sep='#',
                      engine='python',
                      index_col=0,
                      skipinitialspace=True
                      )

df_kiva
Out[3]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
671200 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
671201 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
671202 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly
671203 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
671204 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly

671205 rows × 13 columns

Getting to know data¶

Preliminary considerations on the data types and possible optimizations for later

- funded_amount --> float
- loan_amount --> float
- activity --> str --> cat?
- sector --> str --> cat?
- use --> str
- country_code --> str --> cat or delete?
- country --> str --> cat?
- region --> str
- currency --> str --> cat?
- term in months --> float
- lender_count --> int
- borrower_genders --> str  
- repayment interval --> str

--> later split the column borrower_genders into 2 columns --> borrower_female_count and borrower_male_count --> int

In [4]:
# show data types

df_kiva.dtypes
Out[4]:
funded_amount         float64
loan_amount           float64
activity               object
sector                 object
use                    object
country_code           object
country                object
region                 object
currency               object
term_in_months        float64
lender_count            int64
borrower_genders       object
repayment_interval     object
dtype: object

Evaluation of data types

The data types are as expected and consistent at first glance.

In [5]:
# show column names

df_kiva.columns
Out[5]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
       'country_code', 'country', 'region', 'currency', 'term_in_months',
       'lender_count', 'borrower_genders', 'repayment_interval'],
      dtype='object')
In [6]:
# show general information about the DataFrame

df_kiva.info() 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 671205 entries, 0 to 671204
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       671205 non-null  float64
 1   loan_amount         671205 non-null  float64
 2   activity            671205 non-null  object 
 3   sector              671205 non-null  object 
 4   use                 666972 non-null  object 
 5   country_code        671197 non-null  object 
 6   country             671205 non-null  object 
 7   region              614405 non-null  object 
 8   currency            671205 non-null  object 
 9   term_in_months      671205 non-null  float64
 10  lender_count        671205 non-null  int64  
 11  borrower_genders    666984 non-null  object 
 12  repayment_interval  671205 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 71.7+ MB

Evaluation

  • There are missing values in 4 columns, which will be examined later:
    • use
    • country_code
    • region
    • borrower_genders
  • storage space: 71.7+ MB
In [7]:
# first statistical evaluation

df_kiva.describe()
Out[7]:
funded_amount loan_amount term_in_months lender_count
count 671205.000000 671205.000000 671205.000000 671205.000000
mean 785.995061 842.397107 13.739022 20.590922
std 1130.398941 1198.660073 8.598919 28.459551
min 0.000000 25.000000 1.000000 0.000000
25% 250.000000 275.000000 8.000000 7.000000
50% 450.000000 500.000000 13.000000 13.000000
75% 900.000000 1000.000000 14.000000 24.000000
max 100000.000000 100000.000000 158.000000 2986.000000

Evaluation

The max values for funded_amount and loan_amount are the same and are both outside the value ranges. They will be examined in more detail later.

The max values of the other two columns term_in_months and lender_count are also outside their value ranges and will be examined in more detail later.

The other values appear to be plausible so far.

Duplicates¶

Definition of a duplicate

As data records can be very similar to a large extent due to the many columns with only a few unique values and no distinct values such as date and time, all columns will be used to detect a duplicate.

Identification

In [8]:
# identification of duplicates - all

df_kiva.loc[df_kiva.duplicated(subset=df_kiva,
                               keep=False)].sort_values(ascending=True, by=['funded_amount', 'loan_amount', 'use'])
Out[8]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
671167 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya NaN KES 13.0 0 female, female monthly
671169 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya NaN KES 13.0 0 female, female monthly
671194 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya NaN KES 13.0 0 female, female monthly
671168 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya NaN KES 13.0 0 female monthly
671172 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya NaN KES 13.0 0 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
542600 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 133 male bullet
542790 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 133 male bullet
542887 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 131 female bullet
542922 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 139 male bullet
543030 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 139 male bullet

34930 rows × 13 columns

In [9]:
# display only the values that are duplicated and should be deleted if necessary

df_kiva.loc[df_kiva.duplicated(subset=df_kiva,
                               keep='first')].sort_values(ascending=True, by=['funded_amount', 'loan_amount', 'use'])
Out[9]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
671169 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya NaN KES 13.0 0 female, female monthly
671194 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya NaN KES 13.0 0 female, female monthly
671172 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya NaN KES 13.0 0 female monthly
671197 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya NaN KES 13.0 0 female monthly
671199 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
499778 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 11.0 131 female bullet
502232 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 11.0 130 male bullet
542790 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 133 male bullet
542887 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 131 female bullet
543030 3600.0 3600.0 Poultry Agriculture To buy feed for chicken TH Thailand Chiang Rai THB 14.0 139 male bullet

24588 rows × 13 columns

In [10]:
# percentage of duplicates

# number of lines in the entire DataFrame
count_rows = df_kiva.shape[0]
print(count_rows)

# number of duplicates
count_rows_duplicated = df_kiva.loc[df_kiva.duplicated(subset=df_kiva, keep='first'), 'funded_amount'].count()
print(count_rows_duplicated)

p = round((count_rows_duplicated/count_rows)*100, 2)
print('Prozentsatz:', p, '%')
671205
24588
Prozentsatz: 3.66 %
In [11]:
# display of eye-catching description texts for three examples

print(df_kiva.iloc[671169][4])
print(df_kiva.iloc[671172][4])
print(df_kiva.iloc[671199][4])
Kiva Coordinator fixed issue loan (no longer vague).
Pretend the issue with loan got addressed by Kiva Coordinator.
[True, u'para compara: cemento, arenya y ladriollo para construir una pila.'] - this loan use has been approved by VIVA QA

Handling

There are 34930 data records, each of which occurs at least twice. 24588 rows of these are pure duplicates and should be deleted if necessary - that is approx. 3.66 %.

As all columns were used to identify the duplicates, these data records are clearly to be regarded as duplicates. Theoretically, a data record corresponding to a crowd-investing project can presumably appear several times by the same person for the same purpose for the same amount of money. However, the relationship between the funded_amount, loan_amount and lender_count columns in particular will most likely be different each time.

The above three examples of description texts also suggest that these examples are edited or commented data records by Kiva employees. It is therefore possible that several versions of a project exist as different datasets due to certain edits, which could explain the number of duplicates found.

As only part of the data is available, a decision can currently only be made on the basis of this data.

Based on the above considerations, the duplicates will be deleted.

In [12]:
# deleting the duplicates

df_kiva.drop_duplicates(subset=df_kiva, inplace=True)
In [13]:
# reset index

df_kiva.reset_index(drop=True, inplace=True)
df_kiva
Out[13]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
646612 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
646613 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly
646614 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly
646615 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly
646616 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly

646617 rows × 13 columns

In [14]:
# check whether all duplicates have been removed

df_kiva.loc[df_kiva.duplicated(subset=df_kiva, keep='first')]
Out[14]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval

Missing values¶

The following columns with missing values have already been identified:

  • use
  • country_code
  • region
  • borrower_genders

Identification

  • Recognize missing values
  • Identify placeholders via unique values or, if necessary, by searching for special characters
In [15]:
# Nullwerte identifizieren

df_kiva.isna().sum().sort_values(ascending=False)
Out[15]:
region                56158
use                    3900
borrower_genders       3888
country_code              8
funded_amount             0
loan_amount               0
activity                  0
sector                    0
country                   0
currency                  0
term_in_months            0
lender_count              0
repayment_interval        0
dtype: int64
In [16]:
# missing values in percentages

df_kiva.isna().mean().round(6).sort_values(ascending=False)*100
Out[16]:
region                8.6849
use                   0.6031
borrower_genders      0.6013
country_code          0.0012
funded_amount         0.0000
loan_amount           0.0000
activity              0.0000
sector                0.0000
country               0.0000
currency              0.0000
term_in_months        0.0000
lender_count          0.0000
repayment_interval    0.0000
dtype: float64
In [17]:
# find placeholders - display the number of unique values in each column

df_kiva.nunique()
Out[17]:
funded_amount            610
loan_amount              479
activity                 163
sector                    15
use                   423452
country_code              86
country                   87
region                 12695
currency                  67
term_in_months           148
lender_count             503
borrower_genders       11298
repayment_interval         4
dtype: int64
In [18]:
df_kiva.loc[:, 'activity'].unique()
Out[18]:
array(['Fruits & Vegetables', 'Rickshaw', 'Transportation', 'Embroidery',
       'Milk Sales', 'Services', 'Dairy', 'Beauty Salon', 'Manufacturing',
       'Food Production/Sales', 'Wholesale', 'General Store',
       'Clothing Sales', 'Poultry', 'Tailoring', 'Sewing', 'Bakery',
       'Restaurant', 'Food Stall', 'Farming', 'Construction Supplies',
       'Personal Products Sales', 'Home Products Sales',
       'Natural Medicines', 'Fish Selling', 'Education provider',
       'Shoe Sales', 'Machinery Rental', 'Butcher Shop', 'Pigs',
       'Personal Expenses', 'Food Market', 'Cosmetics Sales',
       'Personal Housing Expenses', 'Retail', 'Energy', 'Grocery Store',
       'Construction', 'Agriculture', 'Motorcycle Transport',
       'Charcoal Sales', 'Food', 'Pharmacy', 'Fishing', 'Timber Sales',
       'Cattle', 'Electronics Repair', 'Electronics Sales', 'Vehicle',
       'Cafe', 'Blacksmith', 'Higher education costs', 'Used Clothing',
       'Fuel/Firewood', 'Upholstery', 'Catering', 'Animal Sales',
       'Cereals', 'Vehicle Repairs', 'Arts',
       'Cloth & Dressmaking Supplies', 'Mobile Phones', 'Spare Parts',
       'Clothing', 'Metal Shop', 'Barber Shop', 'Furniture Making',
       'Crafts', 'Home Energy', 'Home Appliances', 'Wedding Expenses',
       'Taxi', 'Secretarial Services', 'Livestock', 'Property',
       'Recycling', 'Farm Supplies', 'Auto Repair', 'Beverages',
       'Plastics Sales', 'Electrical Goods', 'Carpentry', 'Photography',
       'Jewelry', 'Bricks', 'Pub', 'Phone Use Sales',
       'Water Distribution', 'Paper Sales', 'Computers',
       'Liquor Store / Off-License', 'Utilities', 'Knitting', 'Weaving',
       'Party Supplies', 'Medical Clinic', 'Internet Cafe',
       'Consumer Goods', 'Cement', 'Electrician',
       'Primary/secondary school costs', 'Veterinary Sales',
       'Land Rental', 'Laundry', 'Call Center', 'Perfumes', 'Hotel',
       'Motorcycle Repair', 'Movie Tapes & DVDs', 'Quarrying',
       'Personal Medical Expenses', 'Bookstore', 'Decorations Sales',
       'Recycled Materials', 'Office Supplies', 'Souvenir Sales',
       'Renewable Energy Products', 'Health', 'Printing', 'Phone Repair',
       'Traveling Sales', 'Flowers', 'Bicycle Repair', 'Entertainment',
       'Phone Accessories', 'Hardware', 'Used Shoes',
       'Music Discs & Tapes', 'Games', 'Balut-Making', 'Textiles',
       'Child Care', 'Goods Distribution', 'Florist', 'Cobbler', 'Dental',
       'Bookbinding', 'Cheese Making', 'Bicycle Sales', 'Well digging',
       'Technology', 'Musical Performance', 'Waste Management', 'Film',
       'Tourism', 'Musical Instruments', 'Religious Articles',
       'Machine Shop', 'Cleaning Services', 'Sporting Good Sales',
       'Patchwork', 'Funerals', 'Air Conditioning', 'Communications',
       'Adult Care', 'Landscaping / Gardening', 'Aquaculture',
       'Beekeeping', 'Event Planning', 'Celebrations', 'Computer',
       'Personal Care Products', 'Mobile Transactions'], dtype=object)
In [19]:
df_kiva.loc[:, 'sector'].unique()
Out[19]:
array(['Food', 'Transportation', 'Arts', 'Services', 'Agriculture',
       'Manufacturing', 'Wholesale', 'Retail', 'Clothing', 'Construction',
       'Health', 'Education', 'Personal Use', 'Housing', 'Entertainment'],
      dtype=object)
In [20]:
df_kiva.loc[:, 'country_code'].unique()
Out[20]:
array(['PK', 'IN', 'KE', 'NI', 'SV', 'TZ', 'PH', 'PE', 'SN', 'KH', 'LR',
       'VN', 'IQ', 'HN', 'PS', 'MN', 'US', 'ML', 'CO', 'TJ', 'GT', 'EC',
       'BO', 'YE', 'GH', 'SL', 'HT', 'CL', 'JO', 'UG', 'BI', 'BF', 'TL',
       'ID', 'GE', 'UA', 'XK', 'AL', 'CD', 'CR', 'SO', 'ZW', 'CM', 'TR',
       'AZ', 'DO', 'BR', 'MX', 'KG', 'AM', 'PY', 'LB', 'WS', 'IL', 'RW',
       'ZM', 'NP', 'CG', 'MZ', 'ZA', 'TG', 'BJ', 'BZ', 'SR', 'TH', 'NG',
       'MR', 'VU', 'PA', 'VI', 'VC', 'LA', 'MW', 'MM', 'MD', 'SS', 'SB',
       'CN', 'EG', 'GU', 'AF', 'MG', nan, 'PR', 'LS', 'CI', 'BT'],
      dtype=object)
In [21]:
df_kiva.loc[:, 'country'].unique()
Out[21]:
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
       'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
       'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
       'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
       'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
       'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
       'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
       'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
       'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
       'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
       'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
       'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
       'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
       'Vanuatu', 'Panama', 'Virgin Islands',
       'Saint Vincent and the Grenadines',
       "Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
       'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
       'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
       'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
In [22]:
df_kiva.loc[:, 'currency'].unique()
Out[22]:
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
       'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
       'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
       'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
       'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
       'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
       'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
       'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
In [23]:
df_kiva.loc[:, 'term_in_months'].sort_values(ascending=True).unique()
Out[23]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,  33.,
        34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,  44.,
        45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,  55.,
        56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,  66.,
        67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,
        78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,
        89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,  99.,
       100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110.,
       111., 112., 113., 114., 115., 116., 118., 120., 121., 122., 123.,
       124., 125., 126., 127., 128., 129., 130., 131., 132., 133., 134.,
       135., 136., 137., 138., 139., 141., 142., 143., 144., 145., 146.,
       147., 148., 154., 156., 158.])

None of these values has a decimal place, the data type of the column can be converted to integer later on.

In [24]:
df_kiva.loc[:, 'repayment_interval'].unique()
Out[24]:
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)

Handling

country_code

The unique values in the column country_code and country show that the missing value corresponds to the country 'Namibia'. As the content of the country_code column matches that of the country column and therefore has no added value for the following analyses, the country_code column will be deleted. Filling in the values would be unnecessary and time-consuming.

In [25]:
# deleting the column country_code

df_kiva.drop(columns='country_code', inplace=True)
In [26]:
# check that the column was deleted

df_kiva.columns
Out[26]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use', 'country',
       'region', 'currency', 'term_in_months', 'lender_count',
       'borrower_genders', 'repayment_interval'],
      dtype='object')

use

As the use column contains a short description (short text) and only explains the information in the sector and activity columns in more detail, it will be deleted as well.

However, this column could be interesting if it is examined for patterns using text mining. In case this becomes relevant in the subsequent analysis, a new DataFrame will be created so that the old one including this column is retained.

In [27]:
# deleting the column use

df_kiva2 = df_kiva.drop(columns='use')
In [28]:
# check that the column was deleted

df_kiva2.columns
Out[28]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'country',
       'region', 'currency', 'term_in_months', 'lender_count',
       'borrower_genders', 'repayment_interval'],
      dtype='object')

borrower_genders

This column contains data that will cause problems in subsequent analyses or at the latest for the EDA. For example, one cell contains 'female, female', so it contains the gender of each person who is a borrower. In fact, this is basically a listing.

It is assumed that the unique values are female and male, which are now split into two columns, each containing the number of people as an integer:

  • borrower_female_count
  • borrower_male_count

If there are further values, a provisional column (borrower_rest_count) is created for this, which is checked at the end and then deleted.

On this basis, the system checks what the respective unique values are and how they are distributed. The final handling is decided on the basis of this.

In [29]:
# Create new columns and fill them with the number zero

df_kiva2.loc[:, 'borrower_female_count'] = 0
df_kiva2.loc[:, 'borrower_male_count'] = 0
df_kiva2.loc[:, 'borrower_rest_count'] = 0
In [30]:
df_kiva2.head()
Out[30]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count borrower_genders repayment_interval borrower_female_count borrower_male_count borrower_rest_count
0 300.0 300.0 Fruits & Vegetables Food Pakistan Lahore PKR 12.0 12 female irregular 0 0 0
1 575.0 575.0 Rickshaw Transportation Pakistan Lahore PKR 11.0 14 female, female irregular 0 0 0
2 150.0 150.0 Transportation Transportation India Maynaguri INR 43.0 6 female bullet 0 0 0
3 200.0 200.0 Embroidery Arts Pakistan Lahore PKR 11.0 8 female irregular 0 0 0
4 400.0 400.0 Milk Sales Food Pakistan Abdul Hakeem PKR 14.0 16 female monthly 0 0 0
In [31]:
# How do I get my object? - Testing Area

borrowers = df_kiva2.iloc[(1, 9)]
borrowers
Out[31]:
'female, female'
In [32]:
# How do I get my individual values? - Testing Area

# separate the values at the comma with a space
gender = borrowers.strip().split(', ')
print(gender)  # The result is a list with strings

# single value
gender[0]
['female', 'female']
Out[32]:
'female'
In [33]:
# cell has only one value in the example female, maximum length is 6 - Testing Area

len(df_kiva2.iloc[(0, 9)])
Out[33]:
6
In [34]:
# Write number of values in new column - Testing Area

# for only one value
# df_kiva2.iloc[(0, 11)] = 1 # if female, for male [12]
df_kiva2.head() 

# For multiple values, i.e. a list
list = df_kiva2.iloc[(1, 9)].strip().split(', ') # Specify the length of the list, so count the objects in the list
list.count('female') # Number of the string female within the list

# for missing values
# df_kiva2.iloc[(1, 11)] = 0
# df_kiva2.head()

pd.isna(df_kiva2.iloc[(140, 9)])  # example for np.nan in row 140
Out[34]:
True
In [35]:
# if the value is not a string - Testing Area

type(df_kiva2.iloc[(0, 9)]) != str
Out[35]:
False
In [36]:
# Checking for np.nan via isna() does not seem to work, so the values are replaced by -1

#df_kiva2.fillna({'borrower_genders': -1}, inplace=True)
#df_kiva2.iloc[140, 9] == -1
#type(df_kiva2.iloc[140, 9]) == int
type(df_kiva2.iloc[140, 9]) == float
pd.isna(df_kiva2.iloc[140, 9])
Out[36]:
True
In [37]:
# for-loop to split the column borrower_genders into new columns
# Go through all rows in the column with index 9

rest = {} # if there are values that are neither string nor 'nan', they are written further down into a dict

for index in range(len(df_kiva2)):
    
    # if it is a float
    if type(df_kiva2.iloc[(index, 9)]) == float:
        
        # if it is a np.nan, write the same into the columns borrower_male_count and borrower_female_count
        if pd.isna(df_kiva2.iloc[(index, 9)]):
            df_kiva2.iloc[(index, 11)] = df_kiva2.iloc[(index, 9)]
            df_kiva2.iloc[(index, 12)] = df_kiva2.iloc[(index, 9)]
        
        # if it is a float (and not a np.nan), write the value into the column borrower_rest_count
        else:
            df_kiva2.iloc[(index, 13)] = df_kiva2.iloc[(index, 9)]
        
            # and put the index as a string together with the value into the dictionary rest
            rest[str(index)] = df_kiva2.iloc[(index, 9)]
        
    # if it is a string
    if type(df_kiva2.iloc[(index, 9)]) == str:
    
        # if the length is 4, write 1 into column borrower_male_count
        if len(df_kiva2.iloc[(index, 9)]) == 4:
            df_kiva2.iloc[(index, 12)] = 1 

        # if the length is 6, write 1 into column borrower_female_count
        if len(df_kiva2.iloc[(index, 9)]) == 6:
            df_kiva2.iloc[(index, 11)] = 1

        # otherwise separate the values at comma with a space and save this in the variable gender_list
        else:
            gender_list = df_kiva2.iloc[(index, 9)].strip().split(', ')

            # calculate the number of male and female and write the number in the appropriate column
            count_male = gender_list.count('male')
            df_kiva2.iloc[(index, 12)] = count_male
            count_female = gender_list.count('female')
            df_kiva2.iloc[(index, 11)] = count_female
        
    # otherwise put the value into column borrower_rest_count
    #else:
    #    df_kiva2.iloc[(index, 13)] = df_kiva2.iloc[(index, 9)]
        
        # and put the index as a string together with the value into the dictionary rest
    #    rest[str(index)] = df_kiva2.iloc[(index, 9)]

print('Done!')        
Done!
In [38]:
# Check if everything worked
print(rest)
print(df_kiva2.iloc[140, 9])  # check whether nan value has been transferred correctly
df_kiva2.tail()
{}
nan
Out[38]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count borrower_genders repayment_interval borrower_female_count borrower_male_count borrower_rest_count
646612 0.0 25.0 Livestock Agriculture Paraguay Concepción USD 13.0 0 female monthly 1.0 0.0 0
646613 0.0 25.0 Livestock Agriculture Pakistan Attock PKR 13.0 0 female monthly 1.0 0.0 0
646614 0.0 125.0 Livestock Agriculture Mexico Iztacalco MXN 13.0 0 female, female monthly 2.0 0.0 0
646615 0.0 875.0 Livestock Agriculture Bolivia La Paz BOB 13.0 0 female, female monthly 2.0 0.0 0
646616 0.0 250.0 Livestock Agriculture Ghana Dansoman GHS 13.0 0 female monthly 1.0 0.0 0

The dictionary rest and therefore also the column borrower_rest_count are empty, so there were no unexpected values.

In [39]:
# check unique values of female borrowers

df_kiva2.loc[:, 'borrower_female_count'].unique()
Out[39]:
array([ 1.,  2.,  3.,  8.,  0.,  9.,  5.,  4., 13., 12.,  7., nan, 10.,
        6., 21., 18., 30., 11., 27., 31., 34., 23., 22., 15., 17., 32.,
       36., 19., 14., 25., 20., 33., 40., 24., 26., 39., 28., 16., 35.,
       38., 29., 37., 45., 44., 50., 43., 48., 41., 42., 49.])
In [40]:
# check unique values of male borrowers

df_kiva2.loc[:, 'borrower_male_count'].unique()
Out[40]:
array([ 0.,  1.,  5.,  2.,  3., nan,  8.,  4., 10.,  7.,  9.,  6., 25.,
       13., 12., 11., 18., 16., 21., 24., 15., 19., 23., 31., 17., 26.,
       14., 22., 20., 28., 29., 33., 36., 35., 32., 34., 40., 27., 44.,
       39., 30.])
In [41]:
# check unique values of the column with the rest

df_kiva2.loc[:, 'borrower_rest_count'].unique()
Out[41]:
array([0], dtype=int64)

The dictionary rest and therefore also the borrower_rest_count column are empty, so there were no unexpected values. The column and the borrower_genders column, which is no longer required, will be deleted below.

In [42]:
# deleting the columns borrower_rest_count and borrower_genders

df_kiva2.drop(columns=['borrower_genders', 'borrower_rest_count'], inplace=True)
df_kiva2.columns
Out[42]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'country',
       'region', 'currency', 'term_in_months', 'lender_count',
       'repayment_interval', 'borrower_female_count', 'borrower_male_count'],
      dtype='object')
In [43]:
# count of each unique value in borrower_male_count

df_kiva2.loc[:, 'borrower_male_count'].value_counts()
Out[43]:
0.0     470903
1.0     144126
2.0       8130
3.0       5567
4.0       3830
5.0       3077
6.0       2176
7.0       1514
8.0       1115
9.0        719
10.0       509
11.0       308
12.0       192
13.0       137
14.0        89
15.0        62
16.0        49
17.0        42
18.0        34
19.0        28
20.0        21
22.0        19
21.0        13
25.0        13
23.0        12
24.0         8
27.0         6
26.0         6
28.0         5
36.0         3
33.0         3
29.0         3
31.0         3
35.0         1
32.0         1
34.0         1
40.0         1
44.0         1
39.0         1
30.0         1
Name: borrower_male_count, dtype: int64
In [44]:
# count of each unique value in borrower_female_count 

df_kiva2.loc[:, 'borrower_female_count'].value_counts() 
Out[44]:
1.0     420065
0.0     133307
3.0      15054
2.0      15047
4.0      12168
5.0       9420
7.0       5789
6.0       5581
8.0       4370
10.0      3359
9.0       3074
17.0      1888
11.0      1607
15.0      1511
18.0      1205
12.0      1181
16.0      1079
19.0       925
13.0       860
20.0       858
14.0       723
21.0       590
22.0       490
23.0       440
25.0       355
24.0       349
26.0       224
27.0       201
28.0       172
30.0       157
29.0       117
31.0       102
33.0        75
32.0        71
34.0        55
37.0        52
35.0        51
36.0        45
38.0        26
40.0        18
39.0        16
42.0        15
41.0        11
43.0         9
44.0         5
48.0         5
45.0         4
50.0         2
49.0         1
Name: borrower_female_count, dtype: int64
In [45]:
# sum of current missing values

df_kiva2.isna().sum()
Out[45]:
funded_amount                0
loan_amount                  0
activity                     0
sector                       0
country                      0
region                   56158
currency                     0
term_in_months               0
lender_count                 0
repayment_interval           0
borrower_female_count     3888
borrower_male_count       3888
dtype: int64

How to further deal with missing values:

In both columns, the most common values by far are 1 and 0. In order to be able to continue working with this data, the missing values will be filled with the respective mean value in dependence on another column.

The sector column will be used for this, as it describes the content of the project and it is to be expected that gender is related to this. In addition, the column has relatively few unique values compared to other columns, so that the filling will be relatively quick.

In [46]:
# Group the two columns with the number of genders according to the sector column and calculate the mean value in each case
# round the result

gender_group = df_kiva2.groupby(by=['sector'], 
                                as_index=False, 
                                observed=True).agg({'borrower_female_count': 'mean', 'borrower_male_count': 'mean'}).round()
gender_group
Out[46]:
sector borrower_female_count borrower_male_count
0 Agriculture 1.0 1.0
1 Arts 2.0 0.0
2 Clothing 2.0 0.0
3 Construction 1.0 1.0
4 Education 1.0 0.0
5 Entertainment 1.0 1.0
6 Food 2.0 0.0
7 Health 1.0 0.0
8 Housing 1.0 0.0
9 Manufacturing 1.0 0.0
10 Personal Use 2.0 1.0
11 Retail 2.0 0.0
12 Services 1.0 0.0
13 Transportation 1.0 1.0
14 Wholesale 1.0 1.0
In [47]:
# testing area

test = (gender_group.loc[gender_group.loc[:, 'sector'] == 'Food', 'borrower_female_count'].item())  # should be 2.0
test
type(test)  # should be a float
Out[47]:
float
In [48]:
# fill the nan values depending on the column sector - borrower_female_count

for sector in df_kiva2.loc[df_kiva2.loc[:, 'borrower_female_count'].isna(), 'sector'].unique():
    female_count = (gender_group.loc[gender_group.loc[:, 'sector'] == sector, 'borrower_female_count'].item())
    
    df_kiva2.loc[(df_kiva2.loc[:, 'sector'] == sector) & (df_kiva2.loc[:, 'borrower_female_count'].isna()), 'borrower_female_count'] = female_count

df_kiva2.isna().sum()
Out[48]:
funded_amount                0
loan_amount                  0
activity                     0
sector                       0
country                      0
region                   56158
currency                     0
term_in_months               0
lender_count                 0
repayment_interval           0
borrower_female_count        0
borrower_male_count       3888
dtype: int64
In [49]:
# fill the nan values depending on the column sector - borrower_male_count

for sector in df_kiva2.loc[df_kiva2.loc[:, 'borrower_male_count'].isna(), 'sector'].unique():
    male_count = (gender_group.loc[gender_group.loc[:, 'sector'] == sector, 'borrower_male_count'].item())
    df_kiva2.loc[(df_kiva2.loc[:, 'sector'] == sector) & (df_kiva2.loc[:, 'borrower_male_count'].isna()), 'borrower_male_count'] = male_count

df_kiva2.isna().sum()
Out[49]:
funded_amount                0
loan_amount                  0
activity                     0
sector                       0
country                      0
region                   56158
currency                     0
term_in_months               0
lender_count                 0
repayment_interval           0
borrower_female_count        0
borrower_male_count          0
dtype: int64
In [50]:
# Outdated code!!!
# filling the nan values with the respective mean value

#f_mean = df_kiva2.loc[:, 'borrower_female_count'].mean().round()
#m_mean = df_kiva2.loc[:, 'borrower_male_count'].mean().round()

#df_kiva2.loc[:, 'borrower_female_count'] = df_kiva2.loc[:, 'borrower_female_count'].fillna(value=f_mean)
#df_kiva2.loc[:, 'borrower_male_count'] = df_kiva2.loc[:, 'borrower_male_count'].fillna(value=m_mean)

# check
#df_kiva2.isna().sum() 

region

This column is missing the most values with approx. 8.7 %. As the region is dependent on the country and there is therefore a large number of unique values, it would take too much time to fill in the missing values in dependency. In addition, according to the information from the data dictionary, it is not clear whether this could also involve several regions.

As the data is still to be used, the missing region will be filled in with the country due to its dependency on it.

In [51]:
# exact number of unique values in the region column

df_kiva2.loc[:, 'region'].nunique()
Out[51]:
12695
In [52]:
# testing area

df_kiva2.loc[df_kiva2.loc[:, 'region'].isna(), 'country'].unique()

#country = 'Kenya'
#df_kiva2.loc[(df_kiva2.loc[:, 'country'] == country) & (df_kiva2.loc[:, 'region'].isna()), 'region'] = 'Kenya'
#df_kiva2.loc[df_kiva2.loc[:, 'region'] == 'Kenya']

# Are there already data sets where country equals region?
df_kiva2.loc[df_kiva2.loc[:, 'region'] == df_kiva2.loc[:, 'country']]
Out[52]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count repayment_interval borrower_female_count borrower_male_count
41414 1300.0 1300.0 Livestock Agriculture Guatemala Guatemala GTQ 10.0 47 monthly 2.0 0.0
165448 1850.0 1850.0 Cattle Agriculture Guatemala Guatemala GTQ 14.0 27 monthly 2.0 0.0
318067 1050.0 1050.0 Bakery Food Guatemala Guatemala GTQ 8.0 40 monthly 4.0 0.0
332133 525.0 525.0 Bookstore Retail Guatemala Guatemala GTQ 6.0 20 irregular 2.0 0.0
336960 400.0 400.0 Grocery Store Food Guatemala Guatemala GTQ 8.0 16 monthly 2.0 0.0
356698 1850.0 1850.0 Pharmacy Health Guatemala Guatemala GTQ 14.0 70 monthly 0.0 2.0
494261 12700.0 12700.0 Education provider Education Guatemala Guatemala USD 25.0 409 irregular 1.0 0.0
509027 10000.0 10000.0 Recycled Materials Retail Guatemala Guatemala USD 25.0 340 irregular 0.0 1.0
534338 1875.0 1875.0 Pharmacy Health Guatemala Guatemala GTQ 14.0 56 monthly 0.0 2.0
559627 13000.0 13000.0 Services Services Guatemala Guatemala USD 25.0 380 irregular 0.0 1.0
567792 1175.0 1175.0 Grocery Store Food Guatemala Guatemala GTQ 8.0 46 monthly 1.0 1.0
569267 475.0 475.0 Food Stall Food Guatemala Guatemala GTQ 6.0 9 irregular 2.0 0.0
586750 1725.0 1725.0 Clothing Sales Clothing Guatemala Guatemala GTQ 8.0 30 monthly 7.0 0.0
599186 550.0 550.0 Services Services Guatemala Guatemala GTQ 6.0 22 irregular 0.0 2.0
623756 14300.0 14300.0 Textiles Arts Guatemala Guatemala USD 24.0 521 irregular 0.0 1.0

There are already 15 data records that have the same value in country and region. This is only the case for Guatemala.

In [53]:
# fill in the missing values with the value from column country

for country in df_kiva2.loc[df_kiva2.loc[:, 'region'].isna(), 'country'].unique():
    df_kiva2.loc[(df_kiva2.loc[:, 'country'] == country) & (df_kiva2.loc[:, 'region'].isna()), 'region'] = country
In [54]:
# Check change - are there any missing values?

df_kiva2.loc[df_kiva2.loc[:, 'region'].isna()]
Out[54]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count repayment_interval borrower_female_count borrower_male_count
In [55]:
# Check change - where are region and country the same?

df_kiva2.loc[df_kiva2.loc[:, 'region'] == df_kiva2.loc[:, 'country']]
Out[55]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count repayment_interval borrower_female_count borrower_male_count
5 250.0 250.0 Services Services Kenya Kenya KES 4.0 6 irregular 1.0 0.0
49 450.0 450.0 General Store Retail El Salvador El Salvador USD 14.0 18 monthly 0.0 1.0
54 225.0 225.0 Food Market Food Senegal Senegal XOF 14.0 7 monthly 1.0 0.0
67 125.0 125.0 Energy Services Kenya Kenya KES 3.0 6 irregular 0.0 1.0
70 2000.0 2000.0 Retail Retail Iraq Iraq USD 15.0 71 monthly 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
646606 25.0 25.0 Livestock Agriculture Kenya Kenya KES 13.0 1 monthly 1.0 0.0
646607 0.0 25.0 Livestock Agriculture Kenya Kenya KES 13.0 0 monthly 2.0 0.0
646608 0.0 25.0 Livestock Agriculture Kenya Kenya KES 13.0 0 monthly 1.0 0.0
646609 0.0 25.0 Games Entertainment Kenya Kenya KES 13.0 0 monthly 2.0 0.0
646610 0.0 25.0 Games Entertainment Kenya Kenya KES 13.0 0 monthly 1.0 1.0

56173 rows × 12 columns

Outliers vs. extreme values¶

Already noted above:

The max values for funded_amount and loan_amount are the same and are both outside the value ranges. They will be examined in more detail later.

The max values of the other two columns term_in_months and lender_count are also outside their value ranges and will be examined in more detail later as well.

Identification

In [56]:
# short statistics

df_kiva2.describe()
Out[56]:
funded_amount loan_amount term_in_months lender_count borrower_female_count borrower_male_count
count 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000
mean 808.045242 866.458816 13.719826 21.115360 1.613281 0.415136
std 1145.419041 1214.276106 8.479912 28.840674 3.077798 1.132761
min 0.000000 25.000000 1.000000 0.000000 0.000000 0.000000
25% 275.000000 275.000000 8.000000 7.000000 1.000000 0.000000
50% 475.000000 500.000000 13.000000 13.000000 1.000000 0.000000
75% 925.000000 1000.000000 14.000000 25.000000 1.000000 1.000000
max 100000.000000 100000.000000 158.000000 2986.000000 50.000000 44.000000
In [57]:
# short visualization via histogram - funded_amount and loan_amount

hist_funded_loan = px.histogram(data_frame=df_kiva2.loc[:, ['funded_amount', 'loan_amount']],
                         y=['funded_amount', 'loan_amount'],
                         title='Distribution of data on target amount and amount actually achieved in USD'
                         )

hist_funded_loan.show()
In [58]:
# short visualization via histogram - term_in_months

hist_months = px.histogram(data_frame=df_kiva2.loc[:, 'term_in_months'],
                         y='term_in_months',
                         title='Distribution of data on the duration (in months) over which the amount is paid out'
                         )

hist_months.show()
In [59]:
# short visualization via histogram - lender_count

hist_lender = px.histogram(data_frame=df_kiva2.loc[:, 'lender_count'],
                         y='lender_count',
                         title='Distribution of the number of lenders'
                         )

hist_lender.show()
In [60]:
# short visualization via histogram - borrower_female_count and borrower_male_count

hist_borrower = px.histogram(data_frame=df_kiva2.loc[:, ['borrower_female_count', 'borrower_male_count']],
                         y=['borrower_female_count', 'borrower_male_count'],
                         title='Distribution of the number of borrowers, divided by gender'
                         )

hist_borrower.show()

Handling

All six columns that contain numerical values have maximum values that can be recognized as extreme values.

Columns:

  • funded_amount
  • loan_amount
  • term_in_months
  • lender_count
  • borrower_female_count
  • borrower_male_count

Due to the context of each of these columns, it can be assumed that these are not outliers and the data should therefore be retained. This is because important data for our business could be lost in the event of deletion.

As a result, in the following process the median will be used instead of the mean.

Feature Engineering¶

The columns funded_amount (amount disbursed) and loan_amount (target amount) can be used to calculate the percentage of the target amount that has been reached: reached_amount_pct

This also contains the information as to whether the target amount has been reached - at 100%.

reached_amount_pct = (funded_amount / loan_amount) * 100

In [61]:
# creation of the new column reached_amount_pct

df_kiva2.loc[:, 'reached_amount_pct'] = (df_kiva2.loc[:, 'funded_amount'] / df_kiva2.loc[:, 'loan_amount']) * 100

# check
df_kiva2.head()
Out[61]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count repayment_interval borrower_female_count borrower_male_count reached_amount_pct
0 300.0 300.0 Fruits & Vegetables Food Pakistan Lahore PKR 12.0 12 irregular 1.0 0.0 100.0
1 575.0 575.0 Rickshaw Transportation Pakistan Lahore PKR 11.0 14 irregular 2.0 0.0 100.0
2 150.0 150.0 Transportation Transportation India Maynaguri INR 43.0 6 bullet 1.0 0.0 100.0
3 200.0 200.0 Embroidery Arts Pakistan Lahore PKR 11.0 8 irregular 1.0 0.0 100.0
4 400.0 400.0 Milk Sales Food Pakistan Abdul Hakeem PKR 14.0 16 monthly 1.0 0.0 100.0

Optimize storage space¶

The following optimizations regarding the data type are useful or should be checked:

Change

  • sector --> str --> cat
  • country --> str --> cat
  • currency --> str --> cat
  • term_in_months --> float --> int32
  • lender_count --> int64 --> int32
  • repayment_interval --> str --> cat
  • borrower_female_count --> float --> int32
  • borrower_male_count --> float --> int32

--> int32 will be used, as the maximum values do not exceed its range (-2,147,483,648 to +2,147,483,647).

Check

  • activity --> str --> cat // is the conversion worthwhile?
  • region --> str --> cat // is the conversion worthwhile?
  • reached_amount_pct --> float --> int // are there decimal places?
In [62]:
# show all data types again for an overview

df_kiva2.dtypes
Out[62]:
funded_amount            float64
loan_amount              float64
activity                  object
sector                    object
country                   object
region                    object
currency                  object
term_in_months           float64
lender_count               int64
repayment_interval        object
borrower_female_count    float64
borrower_male_count      float64
reached_amount_pct       float64
dtype: object
In [63]:
# How much memory space do the data set and the individual columns take up?

df_kiva2.memory_usage()
df_kiva2.memory_usage().sum()
Out[63]:
67248296

Sum of the current storage space: 67248296 bytes

In [64]:
# conversion of the corresponding columns to category type

to_cat = ['sector', 'country', 'currency', 'repayment_interval']
df_kiva2[to_cat]  = df_kiva2.loc[:, to_cat].astype('category')
In [65]:
# check category types

df_kiva2.dtypes
Out[65]:
funded_amount             float64
loan_amount               float64
activity                   object
sector                   category
country                  category
region                     object
currency                 category
term_in_months            float64
lender_count                int64
repayment_interval       category
borrower_female_count     float64
borrower_male_count       float64
reached_amount_pct        float64
dtype: object
In [66]:
# conversion of the corresponding columns to int type

to_int = ['lender_count', 'term_in_months', 'borrower_female_count', 'borrower_male_count']
df_kiva2[to_int]  = df_kiva2.loc[:, to_int].astype('int32')
In [67]:
# check int types

df_kiva2.dtypes
Out[67]:
funded_amount             float64
loan_amount               float64
activity                   object
sector                   category
country                  category
region                     object
currency                 category
term_in_months              int32
lender_count                int32
repayment_interval       category
borrower_female_count       int32
borrower_male_count         int32
reached_amount_pct        float64
dtype: object
In [68]:
# Is it worth converting the columns activity and region to category type?

to_cat2 = ['activity', 'region']
df_kiva2[to_cat2]  = df_kiva2.loc[:, to_cat2].astype('category')
In [69]:
# check the type

df_kiva2.dtypes
Out[69]:
funded_amount             float64
loan_amount               float64
activity                 category
sector                   category
country                  category
region                   category
currency                 category
term_in_months              int32
lender_count                int32
repayment_interval       category
borrower_female_count       int32
borrower_male_count         int32
reached_amount_pct        float64
dtype: object
In [70]:
# check storage space

df_kiva2.memory_usage()
Out[70]:
Index                        128
funded_amount            5172936
loan_amount              5172936
activity                 1298706
sector                    647293
country                   649417
region                   1923770
currency                  649257
term_in_months           2586468
lender_count             2586468
repayment_interval        646821
borrower_female_count    2586468
borrower_male_count      2586468
reached_amount_pct       5172936
dtype: int64
In [71]:
# Check - does the reached_amount_pct column have decimal places?

df_kiva2.loc[:, 'reached_amount_pct'].unique()
Out[71]:
array([100.        ,  85.5       ,  80.20833333, ...,  52.21238938,
        49.66442953,   8.97435897])
In [72]:
# Display the amount disbursed, which is more than 100 %

df_kiva2.loc[df_kiva2.loc[:, 'reached_amount_pct'] > 100]
Out[72]:
funded_amount loan_amount activity sector country region currency term_in_months lender_count repayment_interval borrower_female_count borrower_male_count reached_amount_pct
272078 425.0 400.0 General Store Retail Mozambique Boane, Maputo MZN 17 11 monthly 0 1 106.250000
331191 3400.0 3000.0 Farm Supplies Agriculture Armenia Hoktember village, Armavir region USD 38 84 monthly 0 1 113.333333

There are two projects that have received more than the target amount. How is that possible?

In [73]:
# sum of the memory space in bytes after adjusting the data types

df_kiva2.memory_usage().sum()
Out[73]:
31680072

The optimization of the data types has paid off, the memory space is now slightly less than half as much as before.

  • Before: 67 248 296 bytes
  • After: 31 680 072 bytes

Save data¶

Saving the dataset as a pickle file for backup and the following work on EDA and creating the dashboard.

In [74]:
df_kiva2.to_pickle('./01_data_preparation.pkl')

The following file will handle the exploratory data analysis of this dataset: abschlussprojekt_EDA.ipynb